Tokenizer
Split text into word, sentence, paragraph, or document tokens.
Function
NaturalLanguage.tokenize(text, options?): TokenRange[]
Walk text at the requested granularity and return an ordered list of tokens.
options.unit?: "word" | "sentence" | "paragraph" | "document"— defaults to"word".options.language?: Language— optional hint that helps the tokenizer pick the right model for languages without explicit word boundaries (e.g. Chinese, Japanese, Thai).
Each token is:
Examples
Notes
range.location/range.lengthare UTF-16 offsets, matching native JS string indexing.text.substring(range.location, range.location + range.length)always returnstoken.text.- The tokenizer does not filter punctuation or whitespace — if you only want substantive words, filter the result yourself or use
Tagger.tags(..., { omitPunctuation: true, omitWhitespace: true }).
